Week 8: Multiple Regression

PSC 8101 Lab, Wed., Oct. 16

Load Packages and Data

library(haven)
library(tidyverse)
library(psych)
library(jtools)
library(skimr)

states <- read_dta("states.dta")

Simple Regression Results

model1 <- lm(womleg_2017 ~ conpct_m, data = states)
summ(model1)

Simple Regression Results

Observations 50
Dependent variable womleg_2017
Type OLS linear regression
F(1,48) 24.959
0.342
Adj. R² 0.328
Est. S.E. t val. p
(Intercept) 53.465 5.760 9.282 0.000
conpct_m -0.837 0.168 -4.996 0.000
Standard errors: OLS

t Distribution for Statistical Significance

  • Account for degrees of freedom (N-k); finite samples.

t Distribution for Statistical Significance

# t distribution: use pt(tvalue, df); cumulative distribution function
pt(-5, 48)
[1] 4.030833e-06
pt(-1.97, 48)
[1] 0.02731134

t Distribution for Statistical Significance

# inverse: use qt(pvalue, df)
qt(.025, 48)
[1] -2.010635
qt(.025, 148)
[1] -1.976122
qt(.025, 1000)
[1] -1.962339
qt(.025, 2000)
[1] -1.961151
qt(.025, 5000)
[1] -1.960439
qt(.025, 1000000)
[1] -1.959966

Multiple Regression

  • Add church att. and education; first review bivariate relationships:

% Conservative in Each State

states |>
  ggplot(aes(x=conpct_m, y=womleg_2017)) + 
  geom_point() +
  geom_smooth(method="lm", se=FALSE) 

% Church Attendance

states |>
  ggplot(aes(x=attend_pct, y=womleg_2017)) + 
  geom_point() +
  geom_smooth(method="lm", se=FALSE) 

% College Grads

states |>
  ggplot(aes(x=ba_or_more_2015, y=womleg_2017)) + 
  geom_point() +
  geom_smooth(method="lm", se=FALSE) 

Correlation Matrix

states |> 
  select(womleg_2017, conpct_m, ba_or_more_2015, attend_pct) |> 
  cor() |> 
  knitr::kable(digits=3)

Correlation Matrix

womleg_2017 conpct_m ba_or_more_2015 attend_pct
womleg_2017 1.000 -0.585 0.549 -0.659
conpct_m -0.585 1.000 -0.647 0.743
ba_or_more_2015 0.549 -0.647 1.000 -0.528
attend_pct -0.659 0.743 -0.528 1.000

Estimate Multiple Regression

a <- lm(womleg_2017 ~ conpct_m + attend_pct + ba_or_more_2015, data = states)
summ(a)

Estimate Multiple Regression

Observations 50
Dependent variable womleg_2017
Type OLS linear regression
F(3,46) 14.855
0.492
Adj. R² 0.459
Est. S.E. t val. p
(Intercept) 31.937 11.743 2.720 0.009
conpct_m -0.097 0.251 -0.386 0.701
attend_pct -0.387 0.129 -3.008 0.004
ba_or_more_2015 0.385 0.209 1.838 0.073
Standard errors: OLS

Sidenote: Saving and Analyzing Residuals

# Save residuals and yhats; saves them in the states dataset (at the end)
states <- states |> mutate(wres=a$residuals, wyhat=a$fitted)

# Later, we'll use these for diagnostics, e.g.,: 
states |>
  ggplot(aes(y=wres, x=wyhat)) + geom_point()
# Add smoothed line; what assumption is this testing? 
states |>
  ggplot(aes(y=wres, x=wyhat)) + geom_point() + geom_smooth(se=FALSE)

Saving and Analyzing Residuals

# Add smoothed line; what assumption is this testing? 
states |>
  ggplot(aes(y=wres, x=wyhat)) + geom_point() + geom_smooth(se=FALSE)

Netting out Interpretation in Multiple Regression

  • Regress ba on other x’s
ba <- lm(ba_or_more_2015 ~ conpct_m + attend_pct, data = states)
# Save residuals
states <- states |> mutate(ba_res=ba$residuals)
# Regress womleg on ba_res; note coefficient will be same, but not s.e.
c <- lm(womleg_2017 ~ ba_res, data = states)
summ(c, digits=3)

Netting out Interpretation in Multiple Regression

Observations 50
Dependent variable womleg_2017
Type OLS linear regression
F(1,48) 1.860
0.037
Adj. R² 0.017
Est. S.E. t val. p
(Intercept) 25.032 1.072 23.358 0.000
ba_res 0.385 0.282 1.364 0.179
Standard errors: OLS

Standardized Coefficients

  • First, manually standardize all your variables—both dependent and independent variables.
  • Use the “scale” function.
states <- states |> mutate(womleg_st = scale(womleg_2017), 
                   contpct_st = scale(conpct_m), 
                   attend_st = scale(attend_pct), 
                   ba_st = scale(ba_or_more_2015))

Standardized Coefficients

  • Let’s make sure it worked.
  • What should we expect when we check mean and s.d. for transformed variables?
skim(states, womleg_st, contpct_st, attend_st, ba_st)

Standardized Coefficients

Data summary
Name states
Number of rows 50
Number of columns 207
_______________________
Column type frequency:
numeric 4
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
womleg_st 0 1 0 1 -1.82 -0.73 -0.03 0.71 1.96 ▆▇▇▇▃
contpct_st 0 1 0 1 -2.17 -0.70 -0.04 0.80 2.41 ▂▇▇▇▁
attend_st 0 1 0 1 -1.81 -0.74 -0.10 0.81 2.25 ▂▇▅▅▂
ba_st 0 1 0 1 -2.02 -0.64 -0.05 0.55 2.31 ▂▆▇▃▂

Standardized Coefficients

  • Estimate a regression using the standardized variables. Just look at the coefficients, not the SEs.
  • Note that a model using standardized variables does not include a constant/intercept. It’s averaged out in the standardization. Thus, we can add a “0” to eliminate the constant.
lm(womleg_st ~ 0 + contpct_st + attend_st + ba_st, data=states)

Standardized Coefficients


Call:
lm(formula = womleg_st ~ 0 + contpct_st + attend_st + ba_st, 
    data = states)

Coefficients:
contpct_st   attend_st       ba_st  
  -0.06784    -0.47451     0.25430  

Standardized Coefficients

  • Another way to report standardized coefficients more quickly (and without first having to transform variables) is just running the regression with “scaled” variables in the call.
summ(
  lm(scale(womleg_2017) ~ 0 + scale(conpct_m) + scale(attend_pct) + 
     scale(ba_or_more_2015), data = states)
  , digits=3
)

Standardized Coefficients

Observations 50
Dependent variable scale(womleg_2017)
Type OLS linear regression
F(3,47) 15.178
0.492
Adj. R² 0.460
Est. S.E. t val. p
scale(conpct_m) -0.068 0.174 -0.391 0.698
scale(attend_pct) -0.475 0.156 -3.041 0.004
scale(ba_or_more_2015) 0.254 0.137 1.858 0.069
Standard errors: OLS